Information on Replication Materials for:
	Study Title: Replication data for: What's in a Name? A Method for Extracting Information about Ethnicity from Names
	Dataverse: Political Analysis
	DOI: doi:10.7910/DVN/27691 (v1)
	http://dx.doi.org/10.7910/DVN/27691


This document describes the code, functions, and replication data included in the replication archive. Please read the "notes" section below before running the code, as some of the analyses take a significant amount of computing time.

Replication Code:
makeEIfigure.R: Processes election results and names data; implements Greiner-Quinn ecological inference; and creates figure 3. Outputs eiPlotOutputFinal.Rdata.
makeKuresoiPlot.R: Estimates changes in ethnic proportions at the polling station level for the 138 polling stations in Kuresoi constituency between 2007 and 2010 using the new method and a dictionary. Produces plot seen in figure 3.
runMCsims.r: Produces the Monte Carlo tables in the paper and results available in the online appendix.

Functions:
mc.table.R: Function to make Monte Carlo tables.
nameEst.R: Function for the basic name estimation approach based on Pr(Name|Group).
nameEstW.R: Function for the name estimation approach based on Pr(Name|Group), with inverse exponential weighting based on Cook's distance.

Replication Data:
countyEthnicTurnout.Rdata: Black turnout, number, and percentage of registered voters in each county in North Carolina for the 2012 presidential elections. Used to generate figure 2.
allPrecinctsByCounty.Rdata: List containing 100 data.frames, one for each NC county. Each data.frame contains the surname, race, and precinct of registered voters in each county. Used to generate figure 2.
precinctTO.Rdata: Precinct level turnout in 2012 presidential elections in NC for ecological inference.
eiPlotOutputFinal.Rdata: Image of R desktop containg objects for plotting figure 2. Outputted by makeEIfigure.R. 
kuresoiVRs.Rdata: Contains two lists of voter names at the polling station level for 2007 (ps07) and 2010 (ps10) for Kuresoi constituency. Also contains a data.frame containing data on changes in polling station size and percent kikuyu used for ordering polling stations in the plot. For figure 3.
freqsUS.Rdata: Raw frequencies of US data, used for generating US name conditionals. For figure 2 and MC simulations.
nameConditionals.Rdata: Matrix representing the conditional distribution of names given ethnic groups in Kenya. For figure 3 and MC simulations.
dict.Rdata: Dictionary of Kenyan names for Kuresoi analysis. For figure 3.
  

  
Notes: 
When possible, analyses or estimation makes use of the foreach package, enabling the use of multiple cores.
The ecological inference analysis in makeEIfigure.R employs a single chain of 5.5e6 draws from the posterior for each county, resulting in long run times. Two tips to reduce runtimes: (a) use multiple cores so that multiple counties can run simultaneously; (b) if possible, compile R with an optimized BLAS/LAPACK to speed matrix operations. I used the Accelerate framework native to Mac OS X. Results in the ecological inference analysis may vary from run to run given the slow mixing of the chain for the names estimates from the full conditional; this is due to attenuation bias in the names estimates using the full conditional (leading to relatively low inter-precinct variation in racial composition.)

To speed computation of the MC results, reduce the number of trials ("nsims" in the code) from 5000 to a smaller number.

